Search CORE

38 research outputs found

Language Modeling Approaches to Information Retrieval

Author: Banerjee Protima
Han Hyoil
Publication venue: Marshall Digital Scholar
Publication date: 08/04/2009
Field of study

This article surveys recent research in the area of language modeling (sometimes called statistical language modeling) approaches to information retrieval. Language modeling is a formal probabilistic retrieval framework with roots in speech recognition and natural language processing. The underlying assumption of language modeling is that human language generation is a random process; the goal is to model that process via a generative statistical model. In this article, we discuss current research in the application of language modeling to information retrieval, the role of semantics in the language modeling framework, cluster-based language models, use of language modeling for XML retrieval and future trends

Marshall University

Focused multi-document summarization: Human summarization activity vs. automated systems techniques

Author: Han Hyoil
Israel Quinsulon L.
Song Il-Yeol
Publication venue: Marshall Digital Scholar
Publication date: 01/05/2010
Field of study

Focused Multi-Document Summarization (MDS) is concerned with summarizing documents in a collection with a concentration toward a particular external request (i.e. query, question, topic, etc.), or focus. Although the current state-of-the-art provides somewhat decent performance for DUC/TAC-like evaluations (i.e. government and news concerns), other considerations need to be explored. This paper not only briefly explores the state-of-the-art in automatic systems techniques, but also a comparison with human summarization activity

Marshall University

Using semantic similarity to improve user modeling in web personalization systems

Author: Achananuparp Palakorn
Han Hyoil
Publication venue
Publication date: 03/07/2007
Field of study

Personalization is a process by which the users are presented with web resources customized to their interests. Critical to the personalization process is the user model which is the system’s representation of the user characteristics and preferences. However, current web personalization systems traditionally use keywords extracted from contents of visited pages as basis of the user models. The keyword extraction technique, based on vector space model, does not consider the semantics of the content which can be used to improve the characterization of the user preferences. Terms which are semantically related, such as car and vehicle, will be treated separately in keyword-based approach. In this study, we propose a method to improve user modeling in web personalization systems by incorporating the semantics of the content. To achieve that, we map keywords extracted from web pages’ content to concepts in domain ontology. The mapping is based on semantic similarity between terms in WordNet taxonomy

Drexel Libraries E-Repository and Archives

An infrastructure of stream data mining, fusion and management for monitored patients

Author: Han Hyoil
Patrick Herbert
Ryoo Han C.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/05/2007
Field of study

Paper presented at the 19th IEEE International Symposium on Computer-Based Medical Systems, CBMS 2006, Salt Lake City, UT.This paper proposes an infrastructure for data mining, fusion and patient care management using continuous stream data monitored from critically ill patients. Stream data mining, fusion, and management provide efficient ways to increase data utilization and to support knowledge discovery, which can be utilized in many clinical areas to improve the quality of patient care services. The primary goal of our work is to establish a customized infrastructure model designed for critical care services at hospitals. However this structure can be easily expanded to other areas of clinical specialties

Drexel Libraries E-Repository and Archives

Biomedical text annotation and summarization

Author: Brooks Ari D.
Han Hyoil
Reeve Lawrence H.
Publication venue
Publication date: 03/07/2007
Field of study

Advancements in the biomedical community are largely documented and published in text format in scientific forums such as conferences and journals. To address the scalability of utilizing the large volume of text-based information generated by continuing advances in the biomedical field, we propose a two-part text summarization system to reduce the volume of text which must be read by biomedical professionals. The contributions of the text summarization system are a) it utilizes biomedical concepts rather than terms to find the main points of a text, b) it uses two new concept-based algorithms to find important areas of a text for extracting sentences to form a summary, and c) it is supported by a new semantic annotation sub-system, which identifies biomedical concepts found in biomedical text documents. The semantic annotation subsystem uses a novel multiple-filter system architecture for online matching of concepts defined by a biomedical metathesaurus. The goal of semantic annotation is to show that online text-to-concept mapping can be performed without a significant loss of precision as compared to current offline systems. An evaluation shows the text summarization algorithms using concepts outperform existing summarization systems, and the semantic annotation system performs twenty times faster than a state-of-the-art system with no significant loss of precision

Drexel Libraries E-Repository and Archives

Converting semi-structured clinical medical records into information and knowledge

Author: Brooks Ari D.
Chankai Isaac
Han Hyoil
Prestrud Ann A.
Zhou Xiaohua
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/07/2007
Field of study

Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, pp. 1647765.Clinical medical records contain a wealth of information, largely in free-textual form. Thus, means to extract structured information from free-text records becomes an important research endeavor. In this paper, we propose and implement an information extraction system that extracts three types of information — numeric values, medical terms and categorical value — from semi-structured patient records. Three approaches are proposed to solve the problems posed by each of the three types of values, respectively, and very good performance (precision and recall) is achieved. A novel link-grammar based approach was invented to associate feature and number in a sentence, and extremely high accuracy was achieved. A simple but efficient approach, using POS-based pattern and domain ontology, was adopted to extract medical terms of interest. Finally, an NLPbased feature extraction method coupled with an ID3- based decision tree is used to classify and extract categorical cases. This preliminary approach to categorical fields has, so far, proven to be quite effective

Drexel Libraries E-Repository and Archives

Approaches to text mining for clinical medical records

Author: Brooks Ari D.
Chankai Isaac
Han Hyoil
Prestrud Ann A.
Zhou Xiaohua
Publication venue
Publication date: 27/07/2006
Field of study

The 21st Annual ACM Symposium on Applied Computing 2006, Technical tracks on Computer Applications in Health Care (CAHC 2006), Dijon, France, April 23 -27, 2006. Retrieved 6/21/2006 from http://www.ischool.drexel.edu/faculty/hhan/SAC2006_CAHC.pdf.Clinical medical records contain a wealth of information, largely in free-text form. Means to extract structured information from free-text records is an important research endeavor. In this paper, we describe a MEDical Information Extraction (MedIE) system that extracts and mines a variety of patient information with breast complaints from free-text clinical records. MedIE is a part of medical text mining project being conducted in Drexel University. Three approaches are proposed to solve different IE tasks and very good performance (precision and recall) was achieved. A graph-based approach which uses the parsing result of link-grammar parser was invented for relation extraction; high accuracy was achieved. A simple but efficient ontology-based approach was adopted to extract medical terms of interest. Finally, an NLP-based feature extraction method coupled with an ID3-based decision tree was used to perform text classification

Drexel Libraries E-Repository and Archives

A Computationally Efficient System for High-Performance Multi-Document Summarization

Author: Han Hyoil
Sovine Sean
Publication venue: Marshall Digital Scholar
Publication date: 01/05/2013
Field of study

We propose and develop a simple and efficient algorithm for generating extractive multi-document summaries and show that this algorithm exhibits state-of-the-art or near state-of-the-art performance on two Document Understanding Conference datasets and two Text Analysis Conference datasets. Our results show that algorithms using simple features and computationally efficient methods are competitive with much more complex methods for multi-document summarization (MDS). Given these findings, we believe that our summarization algorithm can be used as a baseline in future MDS evaluations. Further, evidence shows that our system is near the upper limit of performance for extractive MDS

Marshall University

From Question Context to Answer Credibility: Modeling Semantic Structures for Question Answering Using Statistical Methods

Author: Banerjee Protima
Han Hyoil
Publication venue: Marshall Digital Scholar
Publication date: 19/05/2009
Field of study

Within a Question Answering (QA) framework, Question Context plays a vital role. We define Question Context to be background knowledge that can be used to represent the user’s information need more completely than the terms in the query alone. This paper proposes a novel approach that uses statistical language modeling techniques to develop a semantic Question Context which we then incorporate into the Information Retrieval (IR) stage of QA. Our approach proposes an Aspect-Based Relevance Language Model as basis of the Question Context Model. This model proposes that the sparse vocabulary of a query can be supplemented with semantic information from concepts (or aspects) related to query terms that already exist within the corpus. We incorporate the Aspect-Based Relevance Language Model into Question Context Model by first obtaining all of the latent concepts that exist in the corpus for a particular question topic. Then, we derive a likelihood of relevance that relates each Context Term (CT) associated with those aspects to the user’s query. Context Terms from the aspects with the highest likelihood of relevance are then incorporated into the query language model based on their relevance score values. We use both query expansion and document model smoothing techniques and evaluate our approach. Our results are promising and show significant improvements in recall using the query expansion method

Marshall University

Proceedings of NAACL HLT 2009: Short Papers , pages 157–160, Boulder, Colorado, June 2009. c 2009 Association for Computational Linguistics Answer Credibility: A Language Modeling Approach to Answer Validation

Author: Banerjee Protima
Han Hyoil
Publication venue: Marshall Digital Scholar
Publication date: 01/06/2009
Field of study

Answer Validation is a topic of significant interest within the Question Answering community. In this paper, we propose the use of language modeling methodologies for Answer Validation, using corpus-based methods that do not require the use of external sources. Specifically, we propose a model for Answer Credibility which quantifies the reliability of a source document that contains a candidate answer and the Question’s Context Model

Marshall University